Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Add schema validation and placeholders to index mappings #3240

Conversation

pyek-bot
Copy link
Contributor

@pyek-bot pyek-bot commented Nov 27, 2024

Description

Added schema validation and placeholders to index mappings

Related Issues

Resolves #2951

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
…ums rather than use their own mappings

Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
Signed-off-by: Pavan Yekbote <[email protected]>
@pyek-bot pyek-bot had a problem deploying to ml-commons-cicd-env-require-approval December 13, 2024 16:07 — with GitHub Actions Failure
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env-require-approval December 13, 2024 16:07 — with GitHub Actions Inactive
Zhangxunmt
Zhangxunmt previously approved these changes Dec 13, 2024
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env-require-approval December 13, 2024 17:39 — with GitHub Actions Inactive
public static String getMappingFromFile(String path) throws IOException {
URL url = IndexUtils.class.getClassLoader().getResource(path);
if (url == null) {
throw new IOException("Resource not found: " + path);
}

String mapping = Resources.toString(url, Charsets.UTF_8).trim();
if (mapping.isEmpty() || !StringUtils.isJson(mapping)) {
throw new IllegalArgumentException("Invalid or non-JSON mapping at: " + path);
if (mapping.isEmpty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this check here considering we are going to check mapping.isBlank() in the replacePlaceholders method. Seems like redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added this check here since this method can be used on its own, incase used in other places, i believe it is safer to have this additional check

}

String placeholderMapping = Resources.toString(url, Charsets.UTF_8);
mapping = mapping.replace(placeholder.getKey(), placeholderMapping);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are creating string in every replace. May be we could use a StringBuilder to have in-place replacement?

May be something like this:

public static String replacePlaceholders(String mapping) throws IOException {
    if (mapping == null || mapping.isBlank()) {
        throw new IllegalArgumentException("Mapping cannot be null or empty");
    }

    // Preload resources into memory to avoid redundant I/O
    Map<String, String> loadedPlaceholders = new HashMap<>();
    for (Map.Entry<String, String> placeholder : MAPPING_PLACEHOLDERS.entrySet()) {
        URL url = IndexUtils.class.getClassLoader().getResource(placeholder.getValue());
        if (url == null) {
            throw new IOException("Resource not found: " + placeholder.getValue());
        }
        // Load and cache the content
        loadedPlaceholders.put(placeholder.getKey(), Resources.toString(url, Charsets.UTF_8));
    }

    // Use StringBuilder for efficient in-place replacements
    StringBuilder result = new StringBuilder(mapping);
    for (Map.Entry<String, String> entry : loadedPlaceholders.entrySet()) {
        String placeholder = entry.getKey();
        String replacement = entry.getValue();

        // Replace all occurrences of the placeholder
        int index;
        while ((index = result.indexOf(placeholder)) != -1) {
            result.replace(index, index + placeholder.length(), replacement);
        }
    }

    return result.toString();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, thanks for the suggestion, saves on the I/O and is more efficient!

@@ -336,4 +347,25 @@ public static JsonObject getJsonObjectFromString(String jsonString) {
return JsonParser.parseString(jsonString).getAsJsonObject();
}

public static void validateSchema(String schemaString, String instanceString) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add try catch block to catch any exceptions like:

catch (JsonProcessingException e) {
        throw new IllegalArgumentException("Invalid JSON format: " + e.getMessage(), e);
    } catch (Exception e) {
        throw new OpenSearchParseException("Schema validation failed: " + e.getMessage(), e);
    }


// Validate JSON node against the schema
Set<ValidationMessage> errors = schema.validate(jsonNode);
if (!errors.isEmpty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this:

if (!errors.isEmpty()) {
            String errorMessage = errors.stream()
                                        .map(ValidationMessage::getMessage)
                                        .collect(Collectors.joining(", "));
            throw new OpenSearchParseException(
                "Validation failed: " + errorMessage + 
                " for instance: " + instanceString + 
                " with schema: " + schemaString
            );
        }

@@ -1,6 +1,6 @@
{
"_meta": {
"schema_version": "1"
"schema_version": 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, I see that JSON files in the repository are named using snake_case. Let's follow the same convention.

@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env-require-approval December 27, 2024 19:36 — with GitHub Actions Inactive
@pyek-bot pyek-bot had a problem deploying to ml-commons-cicd-env-require-approval December 27, 2024 19:36 — with GitHub Actions Failure
@pyek-bot
Copy link
Contributor Author

@dhrubo-os addressed review comments, please review!

- "_meta_.schema_version" provided type is integer

Note: Validation can be made more strict if a specific schema is defined for each index.
* Checks if mapping is a valid json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] if you want to be fancy :D

/**
     * Checks if the provided mapping is a valid JSON and validates it against a schema.
     *
     * <p>The schema is located at <code>mappings/schema.json</code> and enforces the following validations:</p>
     *
     * <ul>
     *   <li>Mandatory fields:
     *     <ul>
     *       <li><code>_meta</code></li>
     *       <li><code>_meta.schema_version</code></li>
     *       <li><code>properties</code></li>
     *     </ul>
     *   </li>
     *   <li>No additional fields are allowed at the root level.</li>
     *   <li>No additional fields are allowed in the <code>_meta</code> object.</li>
     *   <li><code>properties</code> must be an object type.</li>
     *   <li><code>_meta</code> must be an object type.</li>
     *   <li><code>_meta.schema_version</code> must be an integer.</li>
     * </ul>
     *
     * <p><strong>Note:</strong> Validation can be made stricter if a specific schema is defined for each index.</p>
     */

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-12-27 at 3 04 50 PM It will look like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks neat, ive added it

Signed-off-by: Pavan Yekbote <[email protected]>
@pyek-bot pyek-bot had a problem deploying to ml-commons-cicd-env-require-approval December 27, 2024 23:17 — with GitHub Actions Failure
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env-require-approval December 27, 2024 23:17 — with GitHub Actions Inactive
@pyek-bot pyek-bot had a problem deploying to ml-commons-cicd-env-require-approval December 30, 2024 17:52 — with GitHub Actions Failure
@pyek-bot pyek-bot had a problem deploying to ml-commons-cicd-env-require-approval December 30, 2024 19:11 — with GitHub Actions Failure
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env-require-approval December 30, 2024 21:41 — with GitHub Actions Inactive
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env-require-approval December 30, 2024 21:41 — with GitHub Actions Inactive
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env-require-approval December 30, 2024 22:44 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os merged commit 374cfd5 into opensearch-project:main Dec 31, 2024
7 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 31, 2024
#3240)

* feat(index mappings): fetch mappings and version from json file instead of string constants

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: changing exception being thrown

Signed-off-by: Pavan Yekbote <[email protected]>

* chore: remove unused file

Signed-off-by: Pavan Yekbote <[email protected]>

* chore: fix typo in comment

Signed-off-by: Pavan Yekbote <[email protected]>

* chore: adding new line at the end of files

Signed-off-by: Pavan Yekbote <[email protected]>

* feat: add test cases

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: remove test code

Signed-off-by: Pavan Yekbote <[email protected]>

* fix(test): in main the versions were not updated appropriately

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: move mapping templates under common module

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: ensure that conversationindexconstants reference mlindex enums rather than use their own mappings

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: update comment

Signed-off-by: Pavan Yekbote <[email protected]>

* feat: add enhancements to validate index schema and allow using placeholders

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: modifying comment

Signed-off-by: Pavan Yekbote <[email protected]>

* test: adding testcase for MLIndex to catch failures before runtime

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: rename dir from mappings to index-mappings

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: add null checks

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: modify mappin paths for placeholders

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: adding dependencies for testing

Signed-off-by: Pavan Yekbote <[email protected]>

* fix(test): compare json object rather than strings to avoid eol character issue

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: combine if statements into single check

Signed-off-by: Pavan Yekbote <[email protected]>

* refactoring: null handling + clean code

Signed-off-by: Pavan Yekbote <[email protected]>

* spotless apply

Signed-off-by: Pavan Yekbote <[email protected]>

* tests: adding more UT

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: dependencies to handle jarhell

Signed-off-by: Pavan Yekbote <[email protected]>

* spotless apply

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: add header and use single instance of mapper

Signed-off-by: Pavan Yekbote <[email protected]>

* fixed: doc syntax

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: renamed files, efficient loading of resources, better exception handling

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: cleaner comment

Signed-off-by: Pavan Yekbote <[email protected]>

---------

Signed-off-by: Pavan Yekbote <[email protected]>
(cherry picked from commit 374cfd5)
pyek-bot added a commit to pyek-bot/ml-commons that referenced this pull request Jan 7, 2025
opensearch-project#3240)

* feat(index mappings): fetch mappings and version from json file instead of string constants

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: changing exception being thrown

Signed-off-by: Pavan Yekbote <[email protected]>

* chore: remove unused file

Signed-off-by: Pavan Yekbote <[email protected]>

* chore: fix typo in comment

Signed-off-by: Pavan Yekbote <[email protected]>

* chore: adding new line at the end of files

Signed-off-by: Pavan Yekbote <[email protected]>

* feat: add test cases

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: remove test code

Signed-off-by: Pavan Yekbote <[email protected]>

* fix(test): in main the versions were not updated appropriately

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: move mapping templates under common module

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: ensure that conversationindexconstants reference mlindex enums rather than use their own mappings

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: update comment

Signed-off-by: Pavan Yekbote <[email protected]>

* feat: add enhancements to validate index schema and allow using placeholders

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: modifying comment

Signed-off-by: Pavan Yekbote <[email protected]>

* test: adding testcase for MLIndex to catch failures before runtime

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: rename dir from mappings to index-mappings

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: add null checks

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: modify mappin paths for placeholders

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: adding dependencies for testing

Signed-off-by: Pavan Yekbote <[email protected]>

* fix(test): compare json object rather than strings to avoid eol character issue

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: combine if statements into single check

Signed-off-by: Pavan Yekbote <[email protected]>

* refactoring: null handling + clean code

Signed-off-by: Pavan Yekbote <[email protected]>

* spotless apply

Signed-off-by: Pavan Yekbote <[email protected]>

* tests: adding more UT

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: dependencies to handle jarhell

Signed-off-by: Pavan Yekbote <[email protected]>

* spotless apply

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: add header and use single instance of mapper

Signed-off-by: Pavan Yekbote <[email protected]>

* fixed: doc syntax

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: renamed files, efficient loading of resources, better exception handling

Signed-off-by: Pavan Yekbote <[email protected]>

* refactor: cleaner comment

Signed-off-by: Pavan Yekbote <[email protected]>

---------

Signed-off-by: Pavan Yekbote <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] Use index mapping json file
4 participants